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Abstract 

This thesis presents "cache-oblivious" algorithms that use asymptotically optimal 
amounts of work, and move data asymptotically optimally among multiple levels 
of cache. An algorithm is cache oblivious if no program variables dependent on 
hardware configuration parameters, such as cache size and cache-line length need 
to be tuned to minimize the number of cache misses. 

We show that the ordinary algorithms for matrix transposition, matrix multi- 
plication, sorting, and Jacobi-style multipass filtering are not cache optimal. We 
present algorithms for rectangular matrix transposition, FFT, sorting, and multi- 
pass filters, which are asymptotically optimal on computers with multiple levels 
of caches. For a cache with size Z and cache-line length L, where Z = 0(L 2 ), 
the number of cache misses for an m x n matrix transpose is 8(1 + mn/E). The 
number of cache misses for either an n-point FFT or the sorting of n numbers is 
0(1 + (n/L)(l + log z n)). The cache complexity of computing n time steps of a 
Jacobi-style multipass filter on an array of size n is 6(1 + n j L + n 2 j 'ZE). We also 
give an 8(mnp)-work algorithm to multiply an m x n matrix by an n x p matrix 
that incurs @{m + n + p + [mn + np + mp)/E + mnp jEyfZ) cache misses. 

We introduce an "ideal-cache" model to analyze our algorithms, and we prove 
that an optimal cache-oblivious algorithm designed for two levels of memory is 
also optimal for multiple levels. We further prove that any optimal cache-oblivious 
algorithm is also optimal in the previously studied HMM and SUMH models. Al- 
gorithms developed for these earlier models are perforce cache-aware: their be- 
havior varies as a function of hardware-dependent parameters which must be 
tuned to attain optimality. Our cache-oblivious algorithms achieve the same as- 
ymptotic optimality on all these models, but without any tuning. 

Thesis Supervisor: Charles E. Leiserson 

Title: Professor of Computer Science and Engineering 
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Section 1 



Introduction 



Resource-oblivious algorithms that nevertheless use resources efficiently offer ad- 
vantages of simplicity and portability over resource-aware algorithms whose re- 
source usage must be programmed explicitly In this thesis, we study cache re- 
sources, specifically, the hierarchy of memories in modern computers. We exhibit 
several "cache-oblivious" algorithms that use cache as effectively as "cache-aware" 
algorithms. 

Before discussing the notion of cache obliviousness, we introduce the (Z, L) 
ideal-cache model to study the cache complexity of algorithms. This model, which 
is illustrated in Figure 1-1, consists of a computer with a two-level memory hier- 
archy consisting of an ideal (data) cache of Z words and an arbitrarily large main 
memory. Because the actual size of words in a computer is typically a small, fixed 
size (4 bytes, 8 bytes, etc.), we shall assume that word size is constant; the par- 
ticular constant does not affect our asymptotic analyses. The cache is partitioned 
into cache lines, each consisting of L consecutive words that are always moved 
together between cache and main memory. Cache designers typically use L > 1, 
banking on spatial locality to amortize the overhead of moving the cache line. We 
shall generally assume in this thesis that the cache is tall: 

Z = Q(L 2 ), (1.1) 

which is usually true in practice. 

The processor can only reference words that reside in the cache. If the refer- 
enced word belongs to a line already in cache, a cache hit occurs, and the word is 
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Figure 1-1: The ideal-cache model 



delivered to the processor. Otherwise, a cache miss occurs, and the line is fetched 
into the cache. The ideal cache is fully associative [24, Ch. 5]: Cache lines can be 
stored anywhere in the cache. If the cache is full, a cache line must be evicted. The 
ideal cache uses the optimal off-line strategy of replacing the cache line whose next 
access is farthest in the future [7], and thus it exploits temporal locality perfectly. 

An algorithm with an input of size n is measured in the ideal-cache model 
in terms of its work complexity W{n) — its conventional running time in a RAM 
model [4] — and its cache complexity Q[n;Z,L) — the number of cache misses it 
incurs as a function of the size Z and line length L of the ideal cache. When Z and 
L are clear from context, we denote the cache complexity as simply Q{n) to ease 
notation. 

We define an algorithm to be cache aware if it contains parameters (set at ei- 
ther compile-time or runtime) that can be tuned to optimize the cache complexity 
for the particular cache size and line length. Otherwise, the algorithm is cache 
oblivious. Historically, good performance has been obtained using cache-aware 
algorithms, but we shall exhibit several cache-oblivious algorithms for fundamen- 
tal problems that are asymptotically as efficient as their cache-aware counterparts. 

To illustrate the notion of cache awareness, consider the problem of multiply- 
ing two n x n matrices A and B to produce their n x n product C. We assume 
that the three matrices are stored in row-major order, as shown in Figure 2-l(a). 
We further assume that n is "big," i.e., n > L, in order to simplify the analysis. 
The conventional way to multiply matrices on a computer with caches is to use 
a blocked algorithm [22, p. 45]. The idea is to view each matrix M as consist- 
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ing of [n/s] x [n/s] submatrices M !; - (the blocks), each of which has size s x s, 
where s is a tuning parameter. The following algorithm implements this strategy: 

Block-Mult (A, B, C, n) 

1 for i 4- 1 to n/s 

2 do for j 4- 1 to n/s 

3 do for k <- 1 to n/s 

4 do Ord-Mult [Afc, Bkj, Qj,s) 

where ORD-MULT (A, B,C,s) is a subroutine that computes C <- C + AB on s x s 
matrices using the ordinary 0(s 3 ) algorithm (see Section 7.1). (This algorithm as- 
sumes for simplicity that s evenly divides n. In practice, s and n need have no 
special relationship, which yields more complicated code in the same spirit.) 

Depending on the cache size of the machine on which BLOCK-MULT is run, 
the parameter s can be tuned to make the algorithm run fast, and thus BLOCK- 
Mult is a cache-aware algorithm. To minimize the cache complexity, we choose 
s as large as possible such that the three s x s submatrices simultaneously fit in 
cache. An s x s submatrix is stored on 0(s + s 2 /L) cache lines. From the tall- 
cache assumption (1.1), we can see that s = @[y/Z). Thus, each of the calls to 
ORD-MULT runs with at most Z/L = 6(s + s 2 /L) cache misses needed to bring 
the three matrices into the cache. Consequently, the cache complexity of the entire 
algorithm is 9(n + n 2 /L + [n / \[Zf (Z / '!)) = &{n + n 2 /L + n 3 /L\fZ), since the 
algorithm must read n 2 elements, which reside on \n 2 /L] cache lines. 

The same bound can be achieved using a simple cache-oblivious algorithm that 
requires no tuning parameters such as the s in BLOCK-MULT. We present such an 
algorithm, which works on general rectangular matrices, in Section 2. The prob- 
lems of computing a matrix transpose and of performing an FFT also succumb to 
remarkably simple algorithms, which are described in Section 3. Cache-oblivious 
sorting poses a more formidable challenge. In Sections 4 and 5, we present two 
sorting algorithms, one based on mergesort and the other on distribution sort, both 
of which are optimal. Section 6 compares an optimal recursive algorithm with an 
"ordinary" iterative algorithm, both of which compute a multipass filter over one- 
dimensional data. It also provides some brief empirical results for this problem. In 
Section 7, we show that the ordinary algorithms for matrix transposition, matrix 
multiplication, and sorting are not cache optimal. 

The ideal-cache model makes the perhaps-questionable assumption that mem- 
ory is managed automatically by an optimal cache replacement strategy. Although 
the current trend in architecture does favor automatic caching over programmer- 
specified data movement, Section 8 addresses this concern theoretically. We show 
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that the assumptions of two hierarchical memory models in the literature, in which 
memory movement is programmed explicitly, are actually no weaker than ours. 
Specifically, we prove (with only minor assumptions) that optimal cache-oblivious 
algorithms in the ideal-cache model are also optimal in the hierarchical memory 
model (HMM) [1] and in the serial uniform memory hierarchy (SUMH) model 
[5, 42]. Section 9 discusses related work, and Section 10 offers some concluding 
remarks. 

Many of the results in this thesis are based on a joint paper [21] coauthored by 
Matteo Frigo, Charles E. Leiserson, and Sridhar Ramachandran. 
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Section 2 



Matrix multiplication 



This section describes and analyzes an algorithm for multiplying an m x n matrix 
by an n x p matrix cache-obliviously using @{mnp) work and incurring 0(m + 
n + p + [mn + np + mp)/L + mnp jL\/Z] cache misses. These results require the 
tall-cache assumption (1.1) for matrices stored in row-major layout format, but the 
assumption can be relaxed for certain other layouts. We also show that Strassen's 
algorithm [38] for multiplying n x n matrices, which uses 

@{n l °S2 7 ) work, mcurs 

8(1 + n 2 /L + n l0 ^ 7 /L^) cache misses. 

The following algorithm extends the optimal divide-and-conquer algorithm for 
square matrices described in [9] to rectangular matrices. To multiply an m x n ma- 
trix A by an n x p matrix B, the algorithm halves the largest of the three dimensions 
and recurs according to one of the following three cases: 

*B = ( A A B = ( A 'l) , (2.!, 



A 2 J \A 2 B, 

AB = (A 1 A 2 ) (^j =A 1 B 1 +A 2 B 2 , (2.2) 

AB = A (Bi B 2 ) = (AB 1 AB 2 ) . (2.3) 

In case (2.1), we have m > max{n, p}. Matrix A is split horizontally, and both 
halves are multiplied by matrix B. In case (2.2), we have n > max{m, p}. Both 
matrices are split, and the two halves are multiplied. In case (2.3), we have p > 
max {m,n}. Matrix B is split vertically, and each half is multiplied by A. For square 
matrices, these three cases together are equivalent to the recursive multiplication 
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Figure 2-1: Layout of a 16 x 16 matrix in (a) row major, (b) column major, (c) 4 x 4-blocked, 
and (d) bit-interleaved layouts. 

algorithm described in [9]. The base case occurs when m = n = p = 1, in which 
case the two elements are multiplied and added into the result matrix. 

Although this straightforward divide-and-conquer algorithm contains no tun- 
ing parameters, it uses cache optimally. To analyze the algorithm, we assume that 
the three matrices are stored in row-major order, as shown in Figure 2-1 (a). In- 
tuitively, the cache-oblivious divide-and-conquer algorithm uses the cache effec- 
tively, because once a subproblem fits into the cache, its smaller subproblems can 
be solved in cache with no further cache misses. 

Theorem 1 The cache-oblivious matrix multiplication algorithm uses Q[mnp) work and 
incurs @{m + n + p+ [mn + np + mp) /L + mnp / L-s/Z) cache misses when multiplying 
anm x nby ann x p matrix. 

Proof. It can be shown by induction that the work of this algorithm is @{mnp). 
To analyze the cache misses, let oc be a constant sufficiently small that three sub- 
matrices of size m' x n' , n' x p' , and m' x p' , where max{m', n' , p'} < oc^pL, all fit 
completely in the cache. We distinguish the following four cases cases depending 
on the initial size of the matrices. 
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Case I: m,n,p > ocy/Z. 

This case is the most intuitive. The matrices do not fit in cache, since all 
dimensions are "big enough." The cache complexity of matrix multiplication 
can be described by the recurrence 

@{{mn + np + mp)/L) if (ran + np + mp) < ocZ , 
2Q[m/2,n, p) + 0(1) otherwise and if ra > n and m > p , 
2Q{m,n/2, p) + 0(1) otherwise and if n > ra and n > p , 
2Q[m,n,p/2) +0(1) otherwise. 

(2.4) 

The base case arises as soon as all three submatrices fit in cache. The total 
number of lines used by the three submatrices is @{{mn + np + mp) /L). The 
only cache misses that occur during the remainder of the recursion are the 
0( (ran + np + mp) j L) cache misses required to bring the matrices into cache. 
In the recursive cases, when the matrices do not fit in cache, we pay for the 
cache misses of the recursive calls, which depend on the dimensions of the 
matrices, plus 0(1) cache misses for the overhead of manipulating submatri- 
ces. The solution to this recurrence is Q{m,n, p) = ©{mnp/L^/Z). 

Case II: (m < oc\[Z and n,p > aVZ) OR (m < oc^/Z and n,p > ocy/Z) OR (p < 
ay/Z and m,n > a^/Z). 

Here, we shall present the case where m < <x\/Z and n,p > <x-\fZ. The proofs 
for the other cases are only small variations of this proof. The multiplication 
algorithm always divides n or p by 2 according to cases (2.2) and (2.3). At 
some point in the recursion, both are small enough that the whole problem 
fits into cache. The number of cache misses can be described by the recur- 
rence 

{8(1 + n + np/L + m) if n, p G [<xy/Z/2, <xy/Z] , 
2Q[m,n/2, p) + 0(1) otherwise and if n > p , 
2Q[m,n,p/2) +0(1) otherwise. 

The solution to this recurrence is @{np/L + mnp/Ly/Z). 

Case III: (n, p < ay/Z and m > oc\/Z) OR (m, p < oc\[Z and n > oc\/Z) OR 
(m, n < ccy/Z and p > ocy/Z). 

In each of these cases, one of the matrices fits into cache, and the others do 
not. Here, we shall present the case where n,p < tx\J~Z and m > <x\[Z. The 
other cases can be proven similarly. The multiplication algorithm always 



Q{m,n,p) < 
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divides m by 2 according to case (2.1). At some point in the recursion, m is 
in the range ay/Z/2 < m < oc\/Z, and the whole problem fits in cache. The 
number cache misses can be described by the recurrence 



Q{m,n) < 



8(1 + ra) if m e [oc y/Z/2, cx^/Z] , 

2Q{m/2,n,p) + 0(1) otherwise ; 



whose solutionis Q{m,n,p) = Q{m + mnp /L\/Z). 

Case IV: m,n,p < cx\/Z. 

From the choice of oc, all three matrices fit into cache. The matrices are stored 
on8(l +mn/L + np/L + mp/L) cache lines. Therefore, we have Q{m,n,p) = 
8(1 + (ran + np +mp)/L). □ 



We require the tall-cache assumption (1.1) in these analyses, because the matri- 
ces are stored in row-major order. Tall caches are also needed if matrices are stored 
in column-major order (Figure 2-l(b)), but the assumption that Z = H(L 2 ) can be 
relaxed for certain other matrix layouts. The s x s-blocked layout (Figure 2-1 (c)), 
for some tuning parameter s, can be used to achieve the same bounds with the 
weaker assumption that the cache holds at least some sufficiently large constant 
number of lines. The cache-oblivious bit-interleaved layout (Figure 2-1 (d)) has the 
same advantage as the blocked layout, but no tuning parameter need be set, since 
submatrices of size 8(\/L x \/~L) are cache-obliviously stored on one cache line. 
The advantages of bit-interleaved and related layouts have been studied in [18] 
and [12, 13]. One of the practical disadvantages of bit-interleaved layouts is that 
index calculations on conventional microprocessors can be costly. 

For square matrices, the cache complexity Q{n) = 8(n + n 2 /L + n 3 /L\/Z) of the 
cache-oblivious matrix multiplication algorithm is the same as the cache complex- 
ity of the cache-aware BLOCK-MULT algorithm and also matches the lower bound 
by Hong and Kung [25]. This lower bound holds for all algorithms that execute 
the 8 (n 3 ) operations given by the definition of matrix multiplication 

n 

Cij = Y_ aikbkj ■ 

k=l 

No tight lower bounds for the general problem of matrix multiplication are known. 

By using an asymptotically faster algorithm, such as Strassen's algorithm [38] 
or one of its variants [45], both the work and cache complexity can be reduced. 
When multiplying ft x ft matrices, Strassen's algorithm, which is cache oblivious, 
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requires only 7 recursive multiplications of n/2 x n/2 matrices and a constant 
number of matrix additions, yielding the recurrence 



Q(n) < 



8(1 + n + n 2 /L) 
7Q(n/2) + 0(n 2 



if n 2 < ocZ , 
/L) otherwise; 



(2.5) 



where a is a sufficiently small constant. The solution to this recurrence is 0(n + 

n 2 /L + n lo S2 7 /LVZ). 

Summary 

In this section we have used the ideal-cache model to analyze two algorithms 
for matrix multiplication. We have described an efficient cache-oblivious algo- 
rithm for rectangular matrix multiplication and analyzed the cache complexity of 
Strassen's algorithm. 
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Section 3 



Matrix transposition and FFT 



This section describes an optimal cache-oblivious algorithm for transposing an 
m x n matrix. The algorithm uses &[mn) work and incurs 8(1 + mn/L) cache 
misses. Using matrix transposition as a subroutine, we convert a variant [44] of 
the "six-step" fast Fourier transform (FFT) algorithm [6] into an optimal cache- 
oblivious algorithm. This FFT algorithm uses O(nlgn) work and incurs 0(1 + 
(n/L) (1 + log z n) ) cache misses. 

The problem of matrix transposition is defined as follows. Given anmxn ma- 
trix stored in a row-major layout, compute and store A 1 into an n x m matrix B also 
stored in a row-major layout. The straightforward algorithm for transposition that 
employs doubly nested loops incurs Q{mn) cache misses on one of the matrices 
when mn ^> Z, which is suboptimal. 

Optimal work and cache complexities can be obtained with a divide-and-con- 
quer strategy, however. If n > m, we partition 

A = {A 1 A 1 ), B =(^)- 

Then, we recursively execute TRANSPOSE [A\, B x ) and TRANSPOSE [A 2 , B 2 ). Alter- 
natively, if m > n, we divide matrix A horizontally and matrix B vertically and 
likewise perform two transpositions recursively. The next two theorems provide 
upper and lower bounds on the performance of this algorithm. 

Theorem 2 The cache-oblivious matrix-transpose algorithm involves &[mn) work and 
incurs 0(1 + mn / L) cache misses for anm xn matrix. 
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Proof. That the algorithm uses 8 (ran) work can be shown by induction. For the 

cache analysis, let Q(ra, n) be the cache complexity of transposing an ra x n matrix. 

We assume that the matrices are stored in row-major order, the column-major case 

having a similar analysis. 

Let a be a constant sufficiently small that two submatrices of size ra' x n' and 

n' x ra', where max{m',n'} < aL, fit completely in the cache. We distinguish the 

following three cases. 
Case I: max{m, n) < aL. 

Both the matrices fit in O (1 ) + 2ran /L lines. From the choice of a, the number 
of lines required is at most Z/L, which implies Q{m,n) = 8(1 + mn/L). 

Case II: m < aL < n OR n < aL < ra. 

For this case, we assume without loss of generality that m < aL < n. The 
case n < aL < ra is analogous. The transposition algorithm divides the 
greater dimension n by 2 and performs divide-and-conquer. At some point 
in the recursion, n is in the range aL/2 < n < aL, and the whole problem fits 
in cache. Because the layout is row-major, at this point the input array has 
n rows, ra columns, and it is laid out in contiguous locations, thus requiring 
at most 0(1 + nm/L) cache misses to be read. The output array consists of 
nra elements in ra rows, where in the worst case every row lies on a different 
cache line. Consequently, we incur at most 0(ra + nm/L) for writing the 
output array. Since n > aL/2, the total cache complexity for this base case is 
0(l+ra). 

These observations yield the recurrence 



whose solutionis Q{m,n) = 8(1 + mn/L). 

Case III: ra, n > aL. 

As in Case II, at some point in the recursion, both n and ra fall in the interval 
[aL / 2, aL] . The whole problem then fits into cache, and it can be solved with 
at most 0{m + n + mn / L) cache misses. 

The cache complexity thus satisfies the recurrence 





whose solutionis Q{m,n) = 8(1 +mn/L). 



□ 
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Theorem 3 The cache-oblivious matrix-transpose algorithm is asymptotically optimal. 



Proof. For an m x n matrix, the matrix-transposition algorithm must write to mn 
distinct elements, which occupy at least \mn /L] = £1(1 + mn / L) cache lines. □ 

As an example application of the cache-oblivious transposition algorithm, the 
rest of this section describes and analyzes a cache-oblivious algorithm for comput- 
ing the discrete Fourier transform of a complex array of n elements, where n is an 
exact power of 2. The basic algorithm is the well-known "six-step" variant [6, 44] of 
the Cooley-Tukey FFT algorithm [15]. By using the cache-oblivious transposition 
algorithm, however, we can make the FFT cache oblivious, and its performance 
matches the lower bound by Hong and Kung [25]. 

Recall that the discrete Fourier transform (DFT) of an array X of n complex 
numbers is the array Y given by 

n-l 

Y\i] = ^X[;W', (3.1) 

j=0 

where cv n = e ln ^l n is a primitive nth root of unity, and 0 < i < n. 

Many known algorithms evaluate Equation (3.1) in time 0{n lgn) for all inte- 
gers n [17]. In this thesis, however, we assume that n is an exact power of 2, and 
compute Equation (3.1) according to the Cooley-Tukey algorithm, which works re- 
cursively as follows. In the base case where n = 0(1), we compute Equation (3.1) 
directly. Otherwise, for any factorization n = n\n 2 of n, we have 



«2— 1 

Y[ii + i 2 nx\ = Y_ 

/2=0 



^_X[jin 2 + il\CUn[ lh ] CUn 
,1=0 



h)2 



cvn? 2 . (3.2) 



Observe that both the inner and outer summations in Equation (3.2) are DFT's. 
Operationally, the computation specified by Equation (3.2) can be performed by 
computing n 2 transforms of size Wi (the inner sum), multiplying the result by the 
factors CVn 11 ' 2 (called the twiddle factors [17]), and finally computing n\ transforms 
of size n 2 (the outer sum). 

We choose fti to be 2 TCig")/ 2 ! a nd ft 2 to be 2^ 1 8")/ 2 J. The recursive step then 
operates as follows: 

1. Pretend that the input is a row-major n\ x ft 2 matrix A. Transpose A in place, 
i.e., use the cache-oblivious algorithm to transpose A onto an auxiliary array 
B, and copy B back onto A. (If n\ = 2ft 2 , consider the matrix to be made up 
of records containing two elements.) 
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2. At this stage, the inner sum corresponds to a DFT of the n 2 rows of the trans- 
posed matrix. Compute these n 2 DFT's of size n\ recursively. Observe that, 
because of the previous transposition, we are transforming a contiguous ar- 
ray of elements. 

3. Multiply A by the twiddle factors, which can be computed on the fly with no 
extra cache misses. 

4. Transpose A in-place, so that the inputs to the next stage are arranged in 
contiguous locations. 

5. Compute n\ DFT's of the rows of the matrix, recursively. 

6. Transpose A in-place so as to produce the correct output order. 

It can be proven by induction that the work complexity of this FFT algorithm 
is 0[n lg n). We now analyze its cache complexity. The algorithm always operates 
on contiguous data, by construction. In order to simplify the analysis of the cache 
complexity, we assume a tall cache, in which case each transposition operation and 
the multiplication by the twiddle factors require at most 0{l+n/L) cache misses. 
Thus, the cache complexity satisfies the recurrence 



for a sufficiently small constant <x chosen such that a subproblem of size aZ fits in 
cache. This recurrence has solution 



which is asymptotically optimal for a Cooley-Tukey algorithm, matching the lower 
bound by Hong and Kung [25] when n is an exact power of 2. As with matrix mul- 
tiplication, no tight lower bounds for cache complexity are known for the general 
problem of computing the DFT. 

Summary 

In this section, we have described an optimal cache-oblivious algorithm for FFT. 
The basic algorithm is the well-known "six-step" variant [6, 44] of the Cooley- 
Tukey FFT algorithm [15]. By using an optimal cache-oblivious transposition al- 
gorithm, however, we can make the FFT cache oblivious, and its performance 
matches the lower bound by Hong and Kung [25]. 




Q(n)=0(l + (n/L)(l+log z n)) , 
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Section 4 



Funnelsort 



Although it is cache oblivious, algorithms like familiar two-way merge sort (see 
Section 7.3) are not asymptotically optimal with respect to cache misses. The Z- 
way mergesort mentioned by Aggarwal and Vitter [3] is optimal in terms of cache 
complexity, but it is cache aware. This section describes a cache-oblivious sorting 
algorithm called "funnelsort." This algorithm has an asymptotically optimal work 
complexity O(nlgn), as well as an optimal cache complexity 0(1 + [n/L)[l + 
log z n)) if the cache is tall. In Section 5, we shall present another cache-oblivious 
sorting algorithm based on distribution sort. 

Funnelsort is similar to mergesort. In order to sort a (contiguous) array of n 
elements, funnelsort performs the following two steps: 

1. Split the input into n 1 / 3 contiguous arrays of size n 2 / 3 , and sort these arrays 
recursively. 

2. Merge the n 1 / 3 sorted sequences using a n 1 / 3 -merger, which is described be- 
low. 

Funnelsort differs from mergesort in the way the merge operation works. Merg- 
ing is performed by a device called a k-merger, which inputs k sorted sequences 
and merges them. A ?c-merger operates by recursively merging sorted sequences 
that become progressively longer as the algorithm proceeds. Unlike mergesort, 
however, a fc-merger stops working on a merging subproblem when the merged 
output sequence becomes "long enough," and it resumes working on another 
merging subproblem. 
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Figure 4-1: Illustration of a fc-merger. A &:-merger is built recursively out of y/k left yfk- 
mergers L\, C2, . . . , C^, a series of buffers, and one right v^-merger 71. 

Since this complicated flow of control makes a fc-merger a bit tricky to describe, 
we explain the operation of the fc-merger pictorially Figure 4-1 shows a repre- 
sentation of a fc-merger, which has k sorted sequences as inputs. Throughout its 
execution, the fc-merger maintains the following invariant. 

Invariant The invocation of a k-merger outputs the first k 3 elements of the sorted sequence 
obtained by merging the k input sequences. 

A k-merger is built recursively out of v^-mergers in the following way. The k 
inputs are partitioned into y/k sets of y/k elements, and these sets form the input 
to the y/k left V^-mergers £ X/ £ 2/ . . . , £^ in the left part of the figure. The out- 
puts of these mergers are connected to the inputs of y/k buffers. Each buffer is a 
FIFO queue that can hold 2k 3 / 2 elements. Finally, the outputs of the buffers are 
connected to the y/k inputs of the right v^-merger 7Z in the right part of the figure. 
The output of this final y/k-merger becomes the output of the whole fc-merger. The 
reader should notice that the intermediate buffers are overdimensioned. In fact, 
each buffer can hold 2k 3 l 2 elements, which is twice the number k 3 l 2 of elements 
output by a \/fc-merger. This additional buffer space is necessary for the correct 
behavior of the algorithm, as will be explained below. The base case of the recur- 
sion is a fc-merger with k = 2, which produces k 3 = 8 elements whenever invoked. 

A fc-merger operates recursively in the following way. In order to output k 3 
elements, the fc-merger invokes 1Z k 3 l 2 times. Before each invocation, however, the 
fc-merger fills all buffers that are less than half full, i.e., all buffers that contain less 
than k 3 l 2 elements. In order to fill buffer i, the algorithm invokes the corresponding 
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left merger £, once. Since £, outputs k 3 l 2 elements, the buffer contains at least A 3 / 2 
elements after d finishes. 

In order to prove this result, we need three auxiliary lemmata. The first lemma 
bounds the space required by a fc-merger. 

Lemma 4 A k-merger can be laid out in 0{k 2 ) contiguous memory locations. 

Proof. A fc-merger requires O [k 2 ) memory locations for the buffers, plus the space 
required by the v^-mergers. The space S[k) thus satisfies the recurrence 

S(k) < (Vk + l)S(Vk)+0(k 2 ) , 

whose solution is S[k) = 0{k 2 ). □ 

It follows from Lemma 4, that a problem of size ay/Z can be solved in cache 
with no further cache misses, where oc is a sufficiently small constant. 

In order to achieve the bound on the number Q[n) of cache misses, it is im- 
portant that the buffers in a fc-merger be maintained as circular queues of size k. 
This requirement guarantees that we can manage the queue cache-efficiently, in 
the sense stated by the next lemma. 

Lemma 5 Performing r insert and remove operations on a circular queue causes 0(1 + 
r/L) cache misses if four cache lines are available for the buffer. 

Proof. Associate the two cache lines with the head and tail of the circular queue. 
The head- and tail-pointers are kept on two seperate lines. Since the replacement 
strategy is optimal, it will keep the frequently accessed pointers in cache. If a new 
cache line is read during an insert (delete) operation, the next L — 1 insert (delete) 
operations do not cause a cache miss. The result follows. □ 

Define Qm to be the number of cache misses incurred by a fc-merger. The next 
lemma bounds the number of cache misses incurred by a fc-merger. 

Lemma 6 On a tall cache, one invocation of a k-merger incurs 

Q M (k)=0(k + k 3 /L + k 3 \og z k/L) 

cache misses. 

Proof. There are two cases: either k < a^/Z ork > ocy/Z. 

Assume first that k < ay/Z. By Lemma 4, the data structure associated with 
the fc-merger requires at most 0{k 2 ) = 0(Z) contiguous memory locations. By the 
choice of a the fc-merger fits into cache. The fc-merger has k input queues, from 
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which it loads 0 [k 3 ) elements. Let r ; be the number of elements extracted from the 
zth input queue. Since k < ocyfZ and L = 0[y/Z), there are at least Z/L = Cl[k) 
cache lines available for the input buffers. Lemma 5 applies, whence the total 
number of cache misses for accessing the input queues is 

k 

^0(l+r i /L)=0{k + k 3 /L) . 

i=\ 

Similarly by Lemma 5, the cache complexity of writing the output queue is at most 
0(1 + k 3 /L). Finally for touching the 0{k 2 ) contiguous memory locations used by 
the internal data structures, the algorithm incurs at most O ( 1 + k 2 / L ) cache misses. 
The total cache complexity is therefore 

Q M (k) = 0(k + k 3 /L)+0(l+k 2 /L)+0(l+k 3 /L) 
= 0(k + k 3 /L) 

completing the proof of the first case. 

Assume now that k > ay/Z. In this second case, we prove by induction on k 
that whenever k > ocyfZ, we have 

Q M (k) < (ck 3 log z k)/L-A(k), (4.1) 

for some constant c > 0, where A{k) = k[l + (2clog z fc)/L) = o{k 3 ). The lower- 
order term A (k) does not affect the asymptotic behavior, but it makes the induction 
go through. This particular value of A[k) will be justified later in the analysis. 

The base case of the induction consists of values of k such that y/ocZ 1 ^ < k < 
ocyfZ. (It is not sufficient to just consider k = Q{y/Z), since k can become as small 
as 0(Z 1 / 4 ) in the recursive calls.) The analysis of the first case applies, yielding 
Q M {k) = 0(k + k 3 /L). Because k 2 > ay/Z = O(L) and k = Q(l), the last term 
dominates, and QmM = O (k 3 /L) holds. Consequently, a large enough value of c 
can be found that satisfies Inequality (4.1). 

For the inductive case, let k > ay/Z. The fc-merger invokes the v^-mergers 
recursively. Since xfocZ 1 ^ < yfk < k, the inductive hypothesis can be used to 
bound the number Qm( v 7 ^) of cache misses incurred by the submergers. The right 
merger 1Z is invoked exactly k 3 l 2 times. The total number / of invocations of left 
mergers is bounded by I < k 3 l 2 + 2y/k. To see why, consider that every invocation 
of a left merger puts k 3 l 2 elements into some buffer. Since k 3 elements are output 
and the buffer space is 2k 2 , the bound / < k 3 l 2 + 2y/k follows. 

Before invoking 1Z, the algorithm must check every buffer to see whether it 
is empty. One such check requires at most yfk cache misses, since there are yfk 
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buffers. This check is repeated exactly k 3 l 2 times, leading to at most k 2 cache misses 
for all checks. 

These considerations lead to the recurrence 



Q M (k) < (2k 3 / 2 + 2Vk) Q M (Vk)+k 2 



Application of the inductive hypothesis yields the desired bound Inequality (4.1), 
as follows. 



+k 2 



Q M (k) < (20 2 +2Vk)Q M (Vk)+k 2 

< 2{k 3 / 2 + Vk) 

< [ck 3 \og z k)/L +k 2 (l + (clog z fc) /L) 

-(2kV 2 + 2Vk)A(Vk). 

If A[k) = k{l + (2clog z fc) /L) (for example), we get 

Q M (k) < {ck 3 lo Sz k)/L + k 2 (l + (do Sz k)/L) 

- (2k 3 ' 2 + 2Vk) Vk (l + (2c\og z Vk) /l) 

< [ck 3 log z k) /L + k 2 (1 + (clog z fc) /L) 

-(2k 2 +2k) (1 + (clog z fc)/L) 

< [ck 3 log z k)/L (k 2 + 2k) (1 + (clog z fc) / L) 

< (cfc 3 log z fc)/L-A(fc) 



and Inequality (4.1) follows. 

It can be proven by induction that the work complexity of f unnelsort is O [n lg n 
The next theorem gives the cache complexity of funnelsort. 

Theorem 7 Funnelsort sorts n elements incurring at most Q{n) cache misses, where 

Q(n)=0(l + (n/L)(l+log z n)) . 



□ 



Proof. If n < aZ for a small enough constant a, then the funnelsort data structures 
fit into cache. To see why, observe that only one fc-merger is active at any time. 
The biggest fc-merger is the top-level n 1 / 3 -merger, which requires 0{n 2 ! 3 ) < 0{n) 
space. The algorithm thus can operate in 0(1 + n/L) cache misses. 
If n > aZ, we have the recurrence 

Q(n)=n 1 / 3 Q(n 2 / 3 ) + Q M (n 1 / 3 ). 
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By Lemma 6, we have Qm^ 1 / 3 ) = O (ft 1 / 3 + n/L + (ftlog z ft)/L). 

With the hypothesis Z = 0(L 2 ), we have n/L = flfft 1 / 3 ). Moreover, we also 
have n 1 / 3 = Q(l) and lgn = H(lgZ). Consequently, Qm(" 1//3 ) = 0 ((ftlog z ft)/L) 
holds, and the recurrence simplifies to 

Q(n) = n 1 /3 Q(n 2/ 3) + 0 ( (n i ogzn)/L ) . 

The result follows by induction on n. □ 

This upper bound matches the lower bound stated by the next theorem, prov- 
ing that funnelsort is cache-optimal. 

Theorem 8 The cache complexity of any sorting algorithm is 

Q(n) =£1(1 + (n/L) (l + log z n)) . 

Proof. Aggarwal and Vitter [3] show that there is an O ( (ft/ L)log Z/ , L (n/Z) ) bound 
on the number of cache misses made by any sorting algorithm on their "out-of- 
core" memory model, a bound that extends to the ideal-cache model. By applying 
the tall-cache assumption Z = n(L 2 ), we have 

Q(n) > a[n/L)\og z/L [n/Z) 

> fl (n/L)lg(n/Z)/(lgZ-lgL) 

> fl (n/L)lg(n/Z)/lgZ 

> a[n/L) Ign/lgZ — a[n/L) . 

It follows that Q[n) = O((n/L)log z n). The theorem can be proven by combining 
this result with the trivial lower bounds of Q[n) = £1(1) and Q[n) = D(n/L). □ 

Corollary 9 The cache-oblivious Funnelsort is asymptotically optimal. 

Proof. Follows from Theorems 8 and 7. □ 

Summary 

In this section we have presented an optimal cache-oblivious algorithm based on 
mergesort. Funnelsort uses a device called a fc-merger, which inputs k sorted se- 
quences and merges them in "chunks". It stops when the merged output becomes 
"long enough" to resume work on another subproblem. Further, we have shown 
that any sorting algorithm incurs at least O (1 + [n/L) (1 + log z n)) cache misses. 
This lower bound is matched by both our algorithms. 
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Section 5 



Distribution sort 



In this section, we describe a cache-oblivious optimal sorting algorithm based on 
distribution sort. Like the funnelsort algorithm from Section 4, the distribution- 
sorting algorithm uses 0[n lgn) work to sort n elements, and it incurs 

e(l + (n/L)(l+log z n)) 

cache misses if the cache is tall. Unlike previous cache-efficient distribution-sorting 
algorithms [1, 3, 30, 42, 44], which use sampling or other techniques to find the 
partitioning elements before the distribution step, our algorithm uses a "bucket- 
splitting" technique to select pivots incrementally during the distribution. 

Given an array A (stored in contiguous locations) of length n, the cache-oblivi- 
ous distribution sort sorts A as follows: 

1. Partition A into y 7 ^ contiguous subarrays of size ^Jn. Recursively sort each 
sub array. 

2. Distribute the sorted subarrays into q < ^fn buckets B\, B 2 ,...,B q of size n\, 
n 2 , ■ ■ ■ , n q , respectively, such that for i = 1, 2, . . . , q — 1, we have 

1. max{x | x G Bj} < min{x | x G B i+1 } , 

2. m < 2y/n . 

(See below for details.) 

3. Recursively sort each bucket. 

4. Copy the sorted buckets back to array A. 
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A stack-based memory allocator is used to exploit spatial locality. A nice prop- 
erty of stack based allocation is that memory is not fragmented for problems of 
small size. So if the space complexity of a procedure is S, only 0(1 + S/L) cache 
misses are made when S < Z, provided the procedure accesses only its local vari- 
ables. 



Distribution step 

The goal of Step 2 is to distribute the sorted subarrays of A into q buckets B\, 
B 2 , . . . , B q . The algorithm maintains two invariants. First, each bucket holds at 
most I^Jn elements at any time, and any element in bucket B, is smaller than any 
element in bucket B,- +1 . Second, every bucket has an associated pivot, a value 
which is greater than all elements in the bucket. Initially, only one empty bucket 
exists with pivot oo. At the end of Step 2, all elements will be in the buckets and 
the two conditions (a) and (b) stated in Step 2 will hold. 

The idea is to copy all elements from the subarrays into the buckets cache effi- 
ciently while maintaining the invariants. We keep state information for each sub- 
array and for each bucket. The state of a subarray consists of an index next of the 
next element to be read from the subarray and a bucket number bnum indicating 
where this element should be copied. By convention, bnum = oo if all elements in 
a subarray have been copied. The state of a bucket consists of the bucket's pivot 
and the number of elements currently in the bucket. 

We would like to copy the element at position next of a subarray to bucket 
bnum. If this element is greater than the pivot of bucket bnum, we would incre- 
ment bnum until we find a bucket for which the element is smaller than the pivot. 
Unfortunately, this basic strategy has poor caching behavior, which calls for a more 
complicated procedure. 

The distribution step is accomplished by the recursive procedure DISTRIBUTE. 
DISTRIBUTE [i, j,m) distributes elements from the zth through [i + m — l)th sub- 
arrays into buckets starting from By. Given the precondition that each subarray 
r = i, i + 1, . . . , i + m — 1 has its bnum[r] > j, the execution of DISTRIBUTE [i, j, m) 
enforces the postcondition that bnum[r] > j + m. Step 2 of the distribution sort in- 
vokes DISTRIBUTE (1, 1, y/n). The following is a recursive implementation of DIS- 
TRIBUTE: 
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Distribute^', 

1 if m = 1 

2 then CopyElems (z, /) 

3 else Distribute [i,j,m/2) 

4 Distribute [i + m/2,j,m/2) 

5 Distribute [i,j + m/2,m/2) 

6 Distribute (z + m/2,j + m/2,m/2) 



In the base case (line 1), the subroutine CopyElems (z, /) copies all elements from 
subarray i that belong to bucket /. If bucket ; has more than 2-s/n elements after the 
insertion, it can be split into two buckets of size at least y/n. For the splitting oper- 
ation, we use the deterministic median-finding algorithm [16, p. 189] followed by 
a partition. The next lemma shows that the median-finding algorithm uses O(m) 
work and incurs O (1 + m/L) cache misses to find the median of an array of size m. 
(In our case, we have m > 2yfn + 1.) In addition, when a bucket splits, all sub- 
arrays whose bnum is greater than the bnum of the split bucket must have their 
bnum's incremented. The analysis of DISTRIBUTE is given by the following two 
lemmata. 

Lemma 10 The median of m elements can be found cache-obliviously using 0{m) work 
and incurring O (1 + m/L) cache misses. 

Proof See [16, p. 189] for the linear-time median finding algorithm and the work 
analysis. The cache complexity is given by the same recurrence as the work com- 
plexity with a different base case. 



Lemma 11 Step 2 uses 0{n) work, incurs 0(1 + n /L) cache misses, and uses O [n] stack 
space to distribute n elements. 

Proof. In order to simplify the analysis of the work used by DISTRIBUTE, assume 
that CopyElems uses 0(1) work. We account for the work due to copying ele- 
ments and splitting of buckets separately. The work of DISTRIBUTE on m subarrays 
is described by the recurrence 




where a is a sufficiently small constant. The result follows. 



□ 



T[m) =4T(m/2) + 0(l) . 
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It follows that T[m) =0{m 2 ), where m = initially. 

We now analyze the work used for copying and bucket splitting. The number 
of copied elements is 0{n). Each element is copied exactly once and therefore the 
work due to copying elements is also 0{n). The total number of bucket splits is 
at most *Jn. To see why, observe that there are at most ^Jn buckets at the end of 
the distribution step, since each bucket contains at least elements. Each split 
operation involves 0{-\/n) work and so the net contribution to the work is 0{n). 
Thus, the total work used by DISTRIBUTE is W{n) = 0{T{y/n)) + 0{n) + 0{n) = 



For the cache analysis, we distinguish two cases. Let a be a sufficiently small 
constant such that the stack space used by sorting a problem of size ocZ, including 
the input array, fits completely into cache. 

Case I: n < ocZ. 

The input and the auxiliary space of size O (n ) fit into cache using O ( 1 + n /L ) 
cache lines. Consequently, the cache complexity is 0(1 + n/L). 

Case II: n > aZ. 

Let R(m, d) denote the cache misses incurred by an invocation of the subrou- 
tine DISTRIBUTE (i, /', m) that copies d elements from m subarrays to m buck- 
ets. We again account for the splitting of buckets separately. We first prove 
that R satisfies the following recurrence: 



where ^ 1<i<4 dj = d. 

First, consider the base case m < ah. An invocation of DISTRIBUTE (i, j,m) 
operates with m subarrays and m buckets. Since there are O(L) cache lines, 
the cache can hold all the auxiliary storage involved and the currently ac- 
cessed element in each subarray and bucket. In this case there are 0(L + 
d/L) cache misses. The initial access to each subarray and bucket causes 
0{m) = 0{L) cache misses. The cache complexity for copying the d elements 
from one set of contiguous locations to another set of contiguous locations is 
0(1 + d/L), which completes the proof of the base case. The recursive case, 
when m > <xL, follows immediately from the algorithm. The solution for 
Recurrence 5.1 is R{m,d) = 0(L + m 2 /L + d/L). 

We still need to account for the cache misses caused by the splitting of buck- 
ets. Each split causes 0(1 + ^/n/L) cache misses due to median finding 



0(n). 




(5.1) 
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(Lemma 10) and partitioning of y/n contiguous elements. An additional 
0(1 + \/n/L) misses are incurred by restoring the cache. As proven in the 
work analysis, there are at most yfh split operations. 

By adding R[y/n, n) to the complexity of splitting, we conclude that the total 
cache complexity of the distribution step is 0(L + n/L + y/n[l + y/n/L)) = 
()(/;//. . 

□ 

Theorem 12 Distribution sort uses O(nlgn) work and incurs 0(1 + (n/L) (1 + log z n)) 
cache misses to sort n elements. 

Proof. The work done by the algorithm is given by 

<? 

W{n) = v^Wlv^) + Y_ W ( n <) + 0 M ' 

i=i 

where q < yfn, each n, < 2yfn, and Y_]=\ n i = n - Th e solution to this recurrence is 
W[n) = O(nlgn). 

The space complexity of the algorithm is given by 

S{n) < S(2v^) + 0(n) . 

Each bucket has at most 2 y/n elements, thus the recursive call uses at S [2y/n] space 
and the O(n) term comes from Step 2. The solution to this recurrence is S[n) = 
0[n). 

The cache complexity of distribution sort is described by the recurrence 

{0[l+n/L) if n<aZ, 

Vn~Ql\/n~) + Y_ Qlni) + 0(1 + n/L) otherwise , 
i=i 

where a is a sufficiently small constant such that the stack space used by a sorting 
problem of size aZ, including the input array, fits completely in cache. The base 
case n < ocZ arises when both the input array A and the contiguous stack space 
of size S{n) = 0{n) fit in 0(1 + n/L) cache lines of the cache. In this case, the 
algorithm incurs 0(1 + n/L) cache misses to touch all involved memory locations 
once. In the case where n > aZ, the recursive calls in Steps 1 and 3 cause Q[y/n) + 
Y_] = i Q{ni) cache misses and 0(1 + n/L) is the cache complexity of Steps 2 and 4, 
as shown by Lemma 11. The theorem now follows by solving the recurrence. □ 

Corollary 13 The cache-oblivious distribution sort algorithm is asymptotically optimal. 
Proof. Follows from Theorems 8 and 12. □ 
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Summary 

In this section, we have presented another optimal cache-oblivious sorting algo- 
rithm, which is based on distribution sort. All previous cache-efficient distribution 
sort algorithms [1, 3, 30, 42, 44] are cache aware, since they are designed for caching 
models where the data is moved explicitly. They usually use a sampling processes 
to find the partitioning elements before the distribution step. Our algorithm finds 
the pivots incrementally during the distribution. 
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Section 6 



Jacobi multipass filter 



This section compares an optimal recursive algorithm with a more straightforward 
iterative algorithm, both which compute a multipass filter over one-dimensional 
data. When computing n generations on n elements, both algorithms use @{n 2 ) 
work. The iterative incurs @{n 2 /L) cache misses, if the data does not fit into the 
cache, where the recursive algorithm incurs only @{l + n / L + n 2 / ZL) cache misses 
which we prove to be cache optimal. We also provide some brief empirical results 
for this problem. The recursive algorithm executes in less than 70% of the time of 
the iterative algorithm for problem sizes that do not fit in L2-cache 

Consider the problem of a computing a multipass filter on an array A of size 
n, where a new value A| T+1 ' at generation r + 1 is computed from values at the 
previous step t according to some update rule. A typical update function is 



Applications of multipass filtering include the Jacobi iteration for solving heat- 
diffusion equations [31, p. 673] and the simulation of lattice gases with cellular 
automata. These applications usually deal with multidimensional data, but here, 
we shall explore the one-dimensional case for simplicity, even though caching ef- 
fects are often more pronounced with multidimensional data. 




(6.1) 
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JACOBI-ITER(A) 

1 n <— length of A 

2 fori <- 1 ton/2 

3 do for j <— 1 to n > Generation 2z 

4 do tmp[j] 4- (A[{j - 1) mod n] + A[;'] + A[{j + 1) mod n]) /3 

5 for / «— 1 to n > Generation 2z + 1 

6 do A[j] <- (tmp[[j — 1) mod n] + fmp[/] + tmp[[j + 1) mod n]) /3 
Figure 6-1: Iterative implementation of n-pass Jacobi update on array A with n elements. 

6.1 Iterative algorithm 

We first analyze the cache complexity of the straightforward implementation Ja- 
COBl-lTER of the update rule given in Equation (6.1). We show that this algorithm, 
shown in Figure 6-1, uses @[n) temporary storage and performs Q{n 2 ) memory 
accesses for an array of size n. If the array of size n does not fit into cache, the total 
number of cache misses is @{n 2 /L). 

To illustrate the order of updates of JACOBI-ITER on input A of size n, we view 
the computation of n generations of the multipass as a two-dimensional trace ma- 
trix T of size n x n. One dimension of T is the offset in the input array and the 
other dimension is the "generation" of the filtered result. The value of element 
74,2 is the value of array element A[2] at the 4th generation of the iterative algo- 
rithm. One row in the matrix represents the updates on one element in the array. 
The trace matrix of the iterative algorithm on a data array of size 16 is shown in 
Figure 6-2. The height of a bar represents the ordering of the updates, where the 
higher bars are updated later. The bigger the difference in the height of two bars, 
the further apart in time are their updates. If the height of a bar is not much bigger 
than the height of the bar directly in front of it, it is likely that the element is still in 
cache and a hit occurs. The height differences between two updates to the same el- 
ement in the iterative algorithm are all equal. Either the updates are close enough 
together that all updates are cache hits, or they are too far apart, and all updates 
are cache misses. 

Theorem 14 The JACOBI-ITER algorithm uses Q{n 2 ) work when computing n genera- 
tions on an array of size n. JACOBI-ITER incurs 0(1 + n/L) cache misses if the data fits 
into cache, and it incurs Q{n 2 /L) cache misses if the array does not fit into cache. 

Proof. Since there are two nested loops, each of which performs n iterations, the 
work is ®{n 2 ). 
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(a) 



241 242 243 244 245 246 247 248 249 250 

225 226 227 228 229 230 231 232 233 234 

209 210 211 212 213 214 215 216 217 218 

193 194 195 196 197 198 199 200 201 202 

177 178 179 180 181 182 183 184 185 186 

161 162 163 164 165 166 167 168 169 170 

145 146 147 148 149 150 151 152 153 154 

129 130 131 132 133 134 135 136 137 138 

113 114 115 116 117 118 119 120 121 122 

97 98 99 100 101 102 103 104 105 106 

81 82 83 84 85 86 87 88 89 90 

65 66 67 68 69 70 71 72 73 74 

49 50 51 52 53 54 55 56 57 58 

33 34 35 36 37 38 39 40 41 42 

17 18 19 20 21 22 23 24 25 26 
1 23456789 10 

•< data array A 



Figure 6-2: Ordering of updates of JACOBI-ITER on an array with 16 elements, (a) The 
trace matrix with the order of updates from 1 to 256. (b) A bar-graph illustrating the 
updates, where the height of a bar represents the ordering of updates and the smallest bar 
is updated first. 

We now analyze the cache misses of JACOBI-ITER on a (Z, L) ideal cache. Let a 
be a constant sufficiently small that ocZ elements fit into a cache of size Z. As long 
as the array and the temporary storage fit into cache, e.g., n < ocZ, the algorithm 
performs well. The cache complexity is only 0(1 + n/L) since the 0[n) elements 
are read (in order) only once. 

If the array has size n > ocZ, however, then it does not all fit in cache at one time. 
The optimal replacement strategy can keep at most O(Z) elements in cache. Thus, 
per iteration we have Q.(n/L — Z) updates which are cache misses. Consequently, 
the total number of cache misses is @{n)Q.(n / L — Z) = Q.(n 2 / L) for the n iterations. 

The algorithm can be optimized to use only 0(1) temporary storage and avoid 
the modulo computation, but the number of cache misses remains at Q[n 2 /L). □ 

6.2 Recursive algorithm 

In this section, we present an optimal recursive algorithm to compute an n-pass 
Jacobi filter. This cache-oblivious algorithm JACOBI-REC is sketched in Figure 6-3, 
where the input size is a power of 2. 1 We prove that the work used by JACOBI- 

x The algorithm for the general case is slightly more complicated. 
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JacobiA(A, n, s,w,r) 

1 if w > 2 

2 then jACOBlA(A,n / s / w/2,T) 

3 jACOBiA(A,n, [s + w/2),w/2,r) 

4 jACOBlV(A,n, (s + w/4) + \,w/2,r) 

5 jACOBlA(A,n, [s + w/4),w/2,t + w/4) 

6 else pi — t mod 2 

7 q <- (t+ 1) mod 2 

8 A [p] [s mod n] 

9 (A[^][(s-1) modnl +A[^][smodnl +A[g][(s + 1) modnl)/3 

10 A[p][{s + 1) mod n] 

11 (A[ij][smodn] + A[g][(s + 1) mod n] +A[g][(s + 2) modn])/3 

JacobiV(A, n, s,w,r) 

1 if zt> > 2 

2 then JacobiV (A, n,s + w/4,w/2,r) 

3 jACOBlA(A,n,s + w/4 / w/2,t + w/4) 

4 jACOBlV(A / n / s / w/2,T + zf/4) 

5 jACOBlV(A,n,s + w/2,w/2,t + w/4) 

Jacobi-Rec(A) 

1 n •(— length of A 

2 jACOBlA(A,n,0,n,0) 

3 jACOBlV(A,n,n/2,n,0) 

4 jACOBlA(A,n,n/2,n,n/2) 

5 jACOBlV(A,n,0 / n / n/2) 

Figure 6-3: The recursive implementation of the multipass filter on array A of size n, where 
n is a power of 2. The algorithm uses two auxiliary subroutines Jacobi A(A, n, s, w, r) and 
jACOBlV(A n,s,w, t). The input is in A[0] and A[l] is the 0[n) auxiliary space. The 
parameters s and w specify the position and size of the computed triangle and r is the 
generation of the lowest level of the triangle. JACOBI-REC {A) is the initial call. 
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(a) (b) (c) 




data data data 

Figure 6-4: (a) Decomposition of trace matrix in four triangles by JACOBI-REC. Triangles 2 
and 3 "wrap around" since array positions are computed modulo n. (b) shows the decom- 
position used by JACOBI A and (c) shows the decomposition used by JACOBI V. 



Rec is 0(n 2 ) and the cache complexity is 0(1 + n/L + n 2 /ZL), even if the problem 
does not fit into cache, which is a factor of Z fewer cache misses than the iterative 
method. 

In order to simplify the description, we describe the recursive algorithm as if 
the whole trace matrix would be computed. It turns out that in practice one auxil- 
iary array of size n suffices to compute the n steps on an array of size ft. 

The divide-and-conquer algorithm divides the trace matrix into 4 triangles, 
which are recursively divided into smaller triangles, as shown in Figure 6-4(a). 
Two auxiliary functions JACOBI A and JACOBI V are used to implement the recur- 
sion. jACOBlA(A, n, s, w, t) computes an "upper triangle" of the trace matrix A 
of size n, where the base of the triangle has size w and starts at s with generation 
t. It recursively computes up to w /2 generations ahead as shown in Figure 6-4(b). 
Analogously, JACOBIV computes a lower triangle recursively as shown in Fig- 
ure 6-4(c). The resulting trace matrix for an array of size 16 is shown in Figure 6-5. 
It illustrates the locality of the recursive algorithms. The triangles of the decom- 
position are clearly visible. Depending on the cache size, triangles of different size 
fit entirely into cache, which are then computed without any further cache misses. 
Although JACOBI-REC computes the elements in different order than JACOBI-ITER, 
it computes exactly the same values as JACOBI-ITER. 
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Figure 6-5: Ordering of updates of Jacobi-Rec on an array with 16 elements, (a) The 
trace matrix with the order of updates from 1 to 256. (b) A bar-graph illustrating the 
updates, where the height of a bar represents the ordering of updates and the smallest bar 
is updated first. 



Theorem 15 The recursive JACOBI-REC algorithm involves Q(n 2 ) work and incurs 0(1 + 
n/L + n 2 /ZL) cache misses when computing n generations on n elements. 

Proof. To simplify the analysis, we assume that n is an exact power of 2. 2 The 
work of JACOBI-REC can be described by three recurrences: 

W(n) = 2W A (n/2)+2W v (tt/2) + 0(l), 
W A (n) = 3W A (n/2) + W v (n/2)+0(l), 
W v (n) = 3W v (n/2) + W A (n/2) + 0(l); 

where W A and W v are the work used by the recursive procedures jACOBlA and 
JACOBIV. The solution for the total work is W[n) = 0(n 2 ), which is the same as 
the work of the iterative algorithm. 

The number Q(n) of cache misses incurred by a subproblem of size n is de- 
scribed by three recurrences: 

Q(n) = 2Q A (n/2)+2Q v (n/2) + 0(l); 

J 0(l+n/L) ifn < ocZ, 

^ A[H> ~ I 3Q A (n/2) + Q v (n/2) + 0(l) otherwise; 

2 The results can be extended, but the analysis is somewhat more complicated. 
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(a) 



(b) 



(c) 




Figure 6-6: Computational dag of JACOBI-ITER. (a) complete subgraph of size 9x9, (b) its 
decomposition into lines, and (c) diamond-shaped subdag of width D.(d) that is enclosed 
by two nodes u and v of distance d = 4. 



where Qa and Qa are the cache misses of the two recursive procedures JACOBI A 
and JACOBIV, and a: is a sufficiently small constant. The base case occurs when 
the two arrays fit into the cache. Solving these recurrences, we obtain Q{n) = 



6.3 Lower bound 

Finally, we prove that the number of cache misses for this problem is lower bounded 
by 11(1 + n/L + n 2 /ZL), which implies that the recursive algorithm JACOBI-REC is 
indeed optimal. 

We can use the red-blue pebble game technique described by Hong and Kung 
[25] to lower-bound the number of cache misses incurred by any algorithm com- 
puting n generations of an Jacobi-multipass filter on n elements. Hong and Kung 
use properties of the computation dag (directed acyclic graph) G given by a com- 
putation to lower-bound the number of cache misses on a two-level memory. Nodes 
in a computation dag represent operations, and edges, the data-flow of the algo- 
rithm. Nodes with no incoming edges are input and nodes with no outgoing edges 
are output. Figure 6.3(a) shows a subgraph of the computation dag given by Ja- 
COBl-lTER. A vertex-disjoint path from inputs to outputs will be called lines. The 
decomposition of Figure 6.3(a) into lines is shown in in Figure 6.3(b). The number 
of cache misses can be lower-bounded by the information speed function F G {d) of 
a dag G, which is defined as follows. 




8(1 +n/L + n 



2 /ZL) cache misses. 



□ 
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For any two vertices u and v on the same line that are at least d apart, there are 
F G [d] vertices in the dag G satisfying two properties: 

1. None of these vertices belongs to the same line. 

2. Each of these vertices belongs to a path connecting u and v. 

In the dag given by JACOBI-ITER, for example, two nodes u and v enclose a 
diamond-shaped subdag of width D.{d), where d is the distance of u and v, as 
shown in Figure 6.3(c). 

We can obtain lower bounds on the cache complexity Q using the following 
lemma which is proven in [25]. 

Lemma 16 Suppose G is a computation dag where all inputs can reach all outputs through 
vertex-disjoint paths, and its information speed function is D.{F G {d)). IfF G {d) is mono- 
tonically increasing, and F G J [d] exists, then the number of cache misses required to execute 
G is 

Q = C1(K/F G 1 (Z)) / 

where K is the total number of vertices on the vertex-disjoint paths or lines. □ 

We use Lemma 16 to prove a lower bound on the cache complexity of any algo- 
rithm computing n generations of a Jacobi multipass filter on n elements by finding 
an upper bound on F G 1 . 

Theorem 17 Any scheduling of the computation dag induced by the JACOBI-ITER algo- 
rithm on an array of size n incurs £1(1 + n/L + n 2 /LZ) cache misses. 

Proof. This theorem can be proven by applying three lower bounds: 

1. Suppose that L = 1. We can lower-bound the cache complexity using Lem- 
ma 16. Consider the subnetwork of the dag of JACOBI-ITER that includes only 
one third of the edges, as shown in Figure 6.3(b). The subnetwork has n lines 
with K = @{n 2 ) vertices. The information speed function is F{d) = 0.{d), 
since a diamond-shaped subdag of width Cl[d — 2) is enclosed by two nodes 
as illustrated in Figure 6.3(c) for d = 4. Therefore F~ l {d) = 0{d) and the 
resulting lower bound for Q is Q{n) = D.{n 2 /Z). 

At most L data items are moved into cache when a cache miss occurs. Thus, 
a first lower bound f or L > 1 is 

Q(n)=Cl(n 2 /ZL). 
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Figure 6-7: Plot of update time per element per generation for optimized iterative and 
recursive implementations of a multipass filter on a 167-MHz UltraSparc with 16kB Li- 
cache and 512kB L2-cache. 

2. The algorithm must read all @[n) inputs, which reside onfl(n/L) cache lines. 
This yields the second lower bound of Cl[n/L). 

3. The third lower bound is the trivial lower bound of Q{n) = Q(l). 



By combining these lower bounds we get Q{n) = £1(1 +n/L + n 2 /LZ). □ 



6.4 Experimental results 

We now compare optimized implementations of the iterative and the recursive 
algorithms for the simple update rule given in Equation (6.1). (The iterative algo- 
rithm uses only 2 temporary variables, and the recursive implementation uses a 
"unfolded" [18] base case.) Figure 6-7 shows a plot of the update time per element 
per generation for the two versions on a 167-MHz Sun UltraSparc with 16kB Li- 
cache and 512kB L2-cache. The update time for the recursive algorithm is not only 
faster than the iterative algorithm, it is also nearly constant, whereas the iterative 
implementation slows down with every new level of the memory hierarchy. For 
arrays that do not fit in L2-cache, the recursive implementation executes in less 
than 70% of the time of the iterative version. The gain can be even higher for out- 
of-core algorithms, because disk bandwidth is considerably less than memory or 
cache bandwidths. 
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Summary 

In this section, we have presented an optimal recursive algorithm to compute a 
multipass filter over one-dimensional data. We compared its cache complexity 
to a iterative algorithm and gave some brief empirical results for this problem. 
The recursive algorithm executes in less than 70% of the time of the iterative al- 
gorithm on problems that do not fit in L2-cache. The technique presented here 
can be extended to multidimensional stencil-filters. I expect that the advantage of 
the cache-oblivious algorithm on the multidimensional data will prove to be even 
greater. 
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Section 7 

Cache complexity of ordinary 
algorithms 



This section analyzes the cache complexity of the "ordinary" algorithms for ma- 
trix transposition, matrix multiplication, and sorting. Although optimal in the 
random-access machine model [4] and cache oblivious, these algorithms are not 
asymptotically optimal with respect to cache misses. We first prove that the num- 
ber of cache misses of algorithms with a "regular" complexity bound (as defined 
later) is asymptotically the same even if least-recently-used (LRU) is used instead 
of optimal replacement. We then show that the standard iterative algorithm to 
transpose a matrix incurs D.{n 2 ) cache misses on a n x n matrix matching the triv- 
ial upper bound of one cache miss per time step. The ordinary iterative algorithm 
to multiply two n x n matrices incurs D.{n 3 ) cache misses, which is also the worst 
possible asymptotic behavior for an 0(n 3 )-work algorithm. Many "ordinary" al- 
gorithms for sorting exit. We pick mergesort and prove that its cache complexity is 
Q((n/L) lg (n/Z)) when sorting an array of n elements, which is a factor of 8(lgZ) 
away from optimal. 

The ideal-cache model is well suited for algorithm design and upper-bound 
analyses. This comes in part from the optimal replacement strategy employed by 
the ideal-cache. 

Lower-bounding the cache complexity of an algorithm with optimal replace- 
ment is somewhat hard, since it must be proven that the optimal replacement 
strategy will do. For upper bounds, we can pick any replacement strategy we 
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want and the optimal replacement will perform as least as well as our arbitrary 
strategy. However, for lower bounds we must be more careful. We usually do 
not know which line the optimal replacement strategy would replace. The follow- 
ing analysis shows that the optimal and omniscient replacement strategy used by 
an ideal cache can be simulated efficiently by the LRU replacement strategy. The 
LRU strategy replaces the cache line whose most recent access was earliest among 
all lines in the associativity set. In fact, for algorithms with a "regular" complex- 
ity bound, LRU and optimal replacement yield the same asymptotic bounds. We 
define a cache complexity bound Q(w; Z, L) to be regular if 

Q(n- ) Z,L)=0(Q(n- ) 2Z,L)). (7.1) 

Lemma 18 Consider an algorithm that causes Q*[n;Z,L) cache misses on a problem 
of size n using a [Z,L] ideal cache. Then, the same algorithm incurs Q[n;Z,L) < 
2Q* (n; Z/2, L) cache misses on a (Z, L) cache that uses LRU replacement. 

Proof. Sleator and Tarjan [37] have shown that the cache misses on a (Z, L) cache 
using LRU replacement is (Z/(Z — Z* + 1 ) ) -competitive with optimal replacement 
on a (Z*,L) ideal if both caches start with an empty cache. It follows that the 
number of misses on a (Z, L) LRU-cache is at most twice the number of misses on 
a (Z/2, L) ideal-cache. □ 

Corollary 19 For algorithms with regular cache complexity bounds, the asymptotic num- 
ber of cache misses is the same for LRU and optimal replacement. 

Proof. This corollary follows directly from Lemma 18 and the regularity condi- 
tion. □ 

The same argument extends to a variety of other replacement strategies [11], 
including: 

flush when full: Whenever there is a cache miss and there is no space left in the 
cache, evict all lines currently in the cache (call this action a "flush"). 

clock replacement: An approximation to LRU in which a single "use bit" replaces 
the implicit (time of last access) timestamp of LRU. 

first-in, first-out: Replace the line that has been in the fast memory longest. 

random: Whenever a cache miss occurs, evict a page chosen randomly and uni- 
formly among all fast memory pages. 

We shall use Corollary 19 in the following lower-bound proofs and assume that 
the cache is handled by LRU to simplify our analyses. If an algorithm analyzed 
with LRU is regular, then the optimal strategy must also be regular. Therfore, 
according to Corollary 19 the bound derived with the LRU analysis applies to the 
ideal cache model as well. 
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7.1 Matrix multiplication 



In this section, we analyze the straightforward iterative algorithm for matrix mul- 
tiplication. We prove that it causes D. (n 3 ) cache misses when the n x n matrices 
are stored in row-major order and do not fit in cache. We further show that even if 
the matrices are stored in the order in which they are used and do not fit in cache, 
the number of cache misses is at least Q.{n 3 /L), compared to 0(n 3 /Ly/Z) for an the 
cache-optimal 0(n 3 )-work algorithm presented in Section 2. 

The simplest way to compute the product of two matrices is to evaluate the 
formula 

n 

Cij = y~ ajkbkj 
k=i 

directly, as in the following program: 

Ord-Mult(A, B, C, n) 

1 for i 4- 1 to n 

2 do for / «- 1 to n 

3 do c i} <- 0 

4 for k 4- 1 to n 

5 do Cij <- Cij + a ik bkj 

Theorem 20 The Ord-Mult algorithm for matrix multiplication uses Q{n 3 ) work when 
multiplying n x n matrices that do not fit in cache. It incurs £l(n 3 ) cache misses, when 
the matrices are stored in row-major order. Even if the matrices are stored in the order 
in which they are used, Ord-Mult incurs Cl{n 3 /L) cache misses, which is a factor of 
@{\/Z) from optimal. 

Proof. Analyzing the work of Ord-Mult(A, B) is straightforward. Since there 
are three nested loops, each of which performs n iterations, the work is 0(n 3 ). 
Since the algorithm cannot access more than 0(1) elements in constant time, 0(n 3 ) 
is also an upper bound on the number of cache misses for this algorithm. 

First, we assume that both matrices are stored in row-major order (Figure 2- 
1(a)): 

C <- A ■ B 



The number of cache misses can be lower-bounded by counting only the misses 
caused by reading matrix B. In the ith iteration of the outer loop, the elements of 
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the z'th row of C are computed. While i is fixed the inner two loops iterate over all 
n 2 values of k and /, reading all elements of matrix B column by column. 

Assuming that the matrix does not fit in cache, e.g. n » Z/L, the LRU replace- 
ment strategy overwrites lines of matrix B before they can be reused. Therefore, the 
number of misses on matrix B to compute one element of matrix C is Cl[n). Since 
C contains n 2 elements, the algorithm causes Q(n 3 ) cache misses. It follows from 
Corollary 19 that even with an optimal replacement strategy, ORD-MULT incurs 
Cl{n 3 ) cache misses. 

ORD-MULT(A, B) does not exhibit good cache behavior. Accesses to the same 
element, or at least to the same cache line, are far apart. The spatial locality of the 
memory accesses can be improved by changing the memory layout of the matrices: 
The previous analysis showed that the accesses to matrix B alone cause 0(n 3 ) cache 
misses. The problem is that matrix B is stored in row-major order, but accessed 
columnwise. Assume that the memory layout for B is changed from row-major to 
column-major order (Figure 2-l(b)): 

C <- A ■ B 



Now, both matrices are accessed in the order in which they are stored. As long as 
the cache can provide a single line for each of A, B, and C, for each cache miss, 
the following L — 1 accesses are cache hits. Hence, the number of cache misses 
is 0(n 3 /L), which is a factor of 0(L) improvement over the previous algorithm. 
This improvement comes with the disadvantage that the matrices are not stored 
uniformly and it is still a factor of 0(\/Z) away from the cache-optimal algorithms 
shown in Sections 1 and 2. □ 



7.2 Matrix transposition 

In this section, we argue that the iterative algorithm for matrix transposition causes 
D.{n 2 ) cache misses on a n x n matrix, when the matrix is stored in row- or column- 
major order (Figure 2-l(a,b)). This is a factor of &[L) more cache misses than the 
cache-optimal algorithm presented in Section 3. 

The ordinary algorithm for matrix transposition walks through the matrix row 
by row and swaps elements: 
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Ord-Transpose(A, B,n) 

1 for i <— 1 to n 

2 do for j «— 1 to n 

3 do by <— aji 

Theorem 21 Tfte ORD-TRANSPOSE algorithm for matrix multiplication uses Q{n 2 ) work 
and incurs Q.{n 2 ) cache misses, when transposing a n x n matrix that does not fit into 
cache. 

Proof. Transposing a matrix is equivalent to changing the memory layout from 
row- to column-major layout or vice versa. Here, we show that accessing in column- 
major order a matrix stored in row-major layout causes D.{n 2 ) cache misses. After 
Z/L cache misses, the cache is filled and lines must be evicted. The LRU strat- 
egy replaces the lines in the same order in which they are read. Therefore, after 
n accesses, when a line could be reused, it has been evicted from cache by LRU 
replacement. Thus, all n 2 accesses are cache misses. Since Q.(n 3 ) is regular, it fol- 
lows from Corollary 19 that ORD-TRANSPOSE incurs Cl{n 2 ) cache misses in the 
ideal-cache model. □ 



7.3 Mergesort 

We have just shown that divide-and-conquer algorithms presented in Sections 2 
and 3 for matrix multiplication and matrix transposition incur fewer cache misses 
than their iterative counterparts. In this section, we show that divide-and-conquer 
algorithms are not per se cache-optimal. Specifically, we show that Mergesort [16, 
p. 13] incurs Cl[[n/L) lg (n/Z) ) cache misses for an input of size n, which is a fac- 
tor of O(lgZ) more cache misses than the cache-optimal algorithms presented in 
Sections 4 and 5. 

Mergesort is a recursive sorting algorithm that divides the input sequence in 
two parts, sorts them recursively and then merges the two sorted subsequences 
into one sorted sequence. The following pseudocode is the standard description 
of mergesort and can be found in a variety of textbooks [16, 34]. 

Mergesort (A, p,r) 

1 if p < r 

2 thenq <- [{p + r)/2\ 

3 Mergesort (A, p, q) 

4 Mergesort (A, q + l,r) 

5 Merge (A, p, q, r) 
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We assume that the input array A of length n is stored in Q[n) contiguous 
memory locations. MERGESORT uses an auxiliary procedure MERGE (A, p, q, r) that 
merges two sorted subarrays A[p . .q] and A[q + 1 . . r] into a single sorted subarray 
that replaces the current subarray A[p . .r]. Merging two subarrays of length n/2 
uses @{n) work and causes 0{n/L) cache misses assuming that Z/L > 3, since 
the @(n) data items can be accessed in linear order and each cache miss brings L 
elements into cache. 

The work of MERGESORT is 8(nlgn), which is optimal in the random-access 
machine model [16, p. 172]. Although mergesort is a divide-and-conquer algo- 
rithm, its cache complexity is not asymptotically optimal. 

Lemma 22 MERGESORT incurs Q.{[n/L) \g{n/Z)) cache misses for an input of size n. 
Proof. The cache complexity of MERGESORT can be described by the recurrence: 

Q(n) = i a[nlL] Xn<*Z, 

^ [ ' \ 2Q[n/2) +D(n/L) otherwise, v ' ; 

where a is a sufficiently small constant. The base case arises when the @{n) ele- 
ments fit into the cache. Sorting Q{n) elements requires ®[n) auxiliary storage for 
the merging procedure. In the recursive case where n > ocZ, two subproblems of 
half the size are solved and then merged together. After line 3 of MERGESORT is 
executed, LRU replacement will have evicted most of the n/2 data items used in 
the first recursive call. Thus, Q.(n) data elements must be read from contiguous 
memory locations incurring D.(n/L) cache misses. The solution of Equation (7.2) 
is£>((tt/L)lg(rc/Z)). □ 



Summary 

In this section, we have shown that the cache complexity of the ordinary algo- 
rithms for matrix transposition, matrix multiplication, and sorting are not asymp- 
totically optimal. We have proven in Corollary 19 that the optimal replacement 
strategy can be efficiently simulated by the LRU replacement strategy. For algo- 
rithms with regular complexity bounds LRU and optimal replacement yield the 
same asymptotic bounds. Since it is easier to analyze the caching behavior of an 
algorithm when we understand what the replacement strategy does, this Corollary 
often helps to simplify the analyses. 
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Section 8 



Other cache models 



In this section we show that cache-oblivious algorithms designed in the two-level 
ideal-cache model can be efficiently ported to other cache models. We show that 
algorithms with regular complexity bounds (Equation (7.1)) (including all algo- 
rithms heretofore presented) can be ported to less-ideal caches incorporating least- 
recently-used (LRU) or first-in, first-out (FIFO) replacement policies [24, p. 378]. 
We argue that optimal cache-oblivious algorithms are also optimal for multilevel 
caches. Finally we present simulation results proving that optimal cache-oblivious 
algorithms satisfying the regularity condition are also optimal (in expectation) in 
the previously studied SUMH [5, 42] and HMM [1] models. Thus, all the algorith- 
mic results in this thesis apply to these models, matching the best bounds previ- 
ously achieved. 



8.1 Two-level models 

Many researchers, such as [3, 25, 43], employ two-level models similar to the ideal- 
cache model, but without an automatic replacement strategy. In these models, 
data must be moved explicitly between the the primary and secondary levels "by 
hand." 

We now show that optimal algorithms in the ideal-cache model whose cache 
complexity bounds are regular (Equation (7.1)) can be ported to these models 
to run using optimal work and incurring an optimal expected number of cache 
misses. Since previous two-level models do not support automatic replacement, 
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to port a cache-oblivious algorithms to them, we implement an LRU (or FIFO) re- 
placement strategy in software. 

Lemma 23 A (Z, L) LRU cache (or FIFO-cache) can be maintained using 0{Z) primary 
memory locations such that every access to a cache line in primary memory takes O(l) 
expected time. 

Proof. Given the address of the memory location to be accessed, we use a 2- 
universal hash function [29, p. 216] to maintain a hash table of cache lines present 
in the primary memory The Z/L entries in the hash table point to linked lists in 
a heap of memory containing Z/L records corresponding to the cache lines. The 
2-universal hash function guarantees that the expected size of a chain is O(l). All 
records in the heap are organized as a doubly linked list in the LRU order (or singly 
linked for FIFO). Thus, the LRU (FIFO) replacement policy can be implemented in 
O(l) expected time using 0(Z/L) records of O(L) words each. □ 

Theorem 24 An optimal cache-oblivious algorithm with a regular cache-complexity bound 
can be implemented optimally in expectation in two-level models with explicit memory 
management. □ 

Consequently, our cache-oblivious algorithms for matrix multiplication, matrix 
transposition, FFT, and sorting are optimal in two-level models with explicit mem- 
ory management. 



8.2 Multilevel ideal caches 

We now show that optimal cache-oblivious algorithms also perform optimally in 
computers with multiple levels of ideal caches. Moreover, Theorem 24 extends to 
multilevel models with explicit memory management. 

The ( (Zi, Li ), (Z 2 , L 2 ), . . . , (Z r , L r ) ) ideal-cache model consists of an arbitrar- 
ily large main memory and a hierarchy of r caches, each of which is managed by 
an optimal replacement strategy. The model assumes that the caches satisfy the 
inclusion property [24, p. 723], which says that for i = 1, 2, . . . , r — 1, the values 
stored in cache i are also stored in cache i + 1. The performance of an algorithm 
running on an input of size n is measured by its work complexity W[n) and its 
cache complexities Q ; (n; Z ; , L,) for each level i = 1, 2, . . . , r. 

Theorem 25 An optimal cache-oblivious algorithm in the ideal-cache model incurs an 
asymptotically optimal number of cache misses on each level of a multilevel cache with 
optimal replacement. 
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Proof. The theorem follows directly from the definition of cache obliviousness 
and the optimality of the algorithm in the two-level ideal-cache model. □ 

Theorem 26 An optimal cache-oblivious algorithm with a regular cache-complexity bound 
incurs an asymptotically optimal number of cache misses on each level of a multilevel cache 
with LRU or optimal replacement. 

Proof. Follows from Corollary 19 and Theorem 25. □ 

8.3 The SUMH model 

In 1990 Alpern et al. [5] presented the uniform memory hierarchy model (UMH), a 
parameterized model for a memory hierarchy. In the UMH^jjj model, for integer 
constants a, p > 1, the size of the z'th memory level is Z ; = ap 2 ' and the line 
length is L, = p'. A transfer of one p'-length line between the caches on level I and 
I + 1 takes p l /b{l) time. The bandwidth function b[l) must be nonincreasing. The 
processor can access the cache on level 1 in constant time per access. An algorithm 
given for the UMH model must include a schedule that, for a particular set of 
input variables, tells exactly when each block is moved along which of the buses 
between caches. Work and cache misses are folded into one cost measure T{n). 
Alpern et al. prove that an algorithm that performs the optimal number of cache 
misses at all levels of the hierarchy does not necessarily run in optimal time in 
the UMH model, since scheduling bottlenecks can occur when all buses are active. 
In the more restrictive SUMH model [42], however, only one bus is active at a 
time. Consequently, we can prove that optimal cache-oblivious algorithms run in 
optimal expected time in the SUMH model. 

Lemma 27 A cache-oblivious algorithm with W{n) work and Q{n; Z, L) cache misses on 
a (Z, L)-ideal cache can be executed in the SUMH aiPi m) model in expected time 



r-l 




where Z ; = ap 2t , L, = p\ and Z r is big enough to hold all elements used during the 
execution of the algorithm. 

Proof. Use the memory at the z'th level as a cache of size Z, = ap 2 ' with line length 
Li = p 1 and manage it with software LRU described in Lemma 23. The rth level is 
the main memory, which is direct mapped and not organized by the software LRU 
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mechanism. An LRU cache of size 8(Z,) can be simulated by the zth level, since 
it has size Z,. Thus, the number of cache misses at level i is 2Q(n; 8(Z,), L,-), and 
each takes p l /b{i) time. Since only one memory movement happens at any point 
in time and there are 0{W{n)) accesses to level 1, the lemma follows by summing 
the individual costs. □ 

Lemma 28 Consider a cache-optimal algorithm whose work on a problem of size n is 
lower-bounded by W* [n] and whose cache complexity is lower-bounded by Q* (n; Z,L) on 
an (Z, L) ideal-cache. Then, no matter how data movement is implemented in SUMH^p^^), 
the time taken on a problem of size n is at least 




where Z ; = ap 2 ', L, = p 1 and Z r is big enough to hold all elements used during the 
execution of the algorithm. 

Proof. The optimal scheduling of the data movements does not need to obey the 
inclusion property, and thus the number of z'th-level cache misses is at least as large 
as for an ideal cache of size Yl\=\ = 0[Zi). Since Q*{n,Z,L) lower-bounds the 
cache misses on a cache of size Z, at least Q*{n, 8(Z,), L,-) data movements occur 
at level i, each of which takes p'/b{i) time. Since only one movement can occur at a 
time, the total cost is the maximum of the work and the sum of the costs at all the 
levels. □ 

Theorem 29 A cache-oblivious algorithm that is optimal in the ideal-cache model and 
whose cache complexity is regular can be executed in the SUMH a/P/ m) model in optimal 
expected time. 

Proof. The theorem follows directly from regularity and Lemmata 27 and 28. □ 



8.4 The HMM model 

Aggarwal, Alpern, Chandra, and Snir [1] proposed the hierarchical memory model 
(HMM) in which an access to location x takes f{x) time. The authors assume that 
/ is a monotonically nondecreasing function, usually of the form [log x] or \x a ] . 

Lemma 30 Consider a cache-oblivious algorithm with W{n) work and Q{n\ Z, L) cache 
misses on a [Z,L) ideal cache. Let Z\ < Z 2 < ■ ■ ■ < Z r be positive integers such that a 
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cache of size Z r can hold all of the data used during the execution of the algorithm. Then, 
the algorithm can be executed in the HMM model with cost function f in expected time 

r 

T[n)=o(w[n)f[si) + 2if[s i )Q[n ] e[Z i ),lj) , 

i=2 

where S\ = 0{Z{], S2 = s\ + 0{Z 2 ), . . ., s r = s r _i + 0{Z r ). 

Proof. Using Lemma 23 we can simulate a ( [Z\, 1 ), (Z 2 , 1 ), . . . , [Z r , 1 ) ) LRU cache 
in the HMM model by using locations 1, 2, . . . , S\ to implement cache 1, locations 
Si + 1, S\ + 2, . . . , s 2 to implement cache 2, etc. The cost of each access to the ith 
cache is at most /(s,-). Cache 1 is accessed at most W[n) times, cache 2 is accessed 
at most Q(n; 0(Z 2 ), 1) times, and so forth. The lemma follows. □ 

Lemma 31 Consider a cache-optimal algorithm whose work on a problem of size n is 
lower-bounded by W*[n) and whose cache complexity is lower-bounded by Q*[n;Z,L) 
on an (Z, L) ideal cache. Then, no matter how data movement is implemented in an HMM 
model with cost function f, the time taken on a problem of size n is at least 

r 

T{n] = Cl(w*{n) + ^(/(Z^ - 1) -/(Z*_ 2 - D)Q*(n;Z,-, 1)) 
i=\ 

for any Z 0 = 1 < Z\ < ■ ■ ■ < Z r such that a cache of size Z r can hold all of the data used 
during the execution of the algorithm. 

Proof. The memory of the HMM model can be viewed as a cache hierarchy with 
arbitrary parameters Z 0 = 1 < Z\ < ■ ■ ■ < Z r , where the memory elements are 
mapped to fixed locations in the caches. The processor works on elements in the 
level 0 cache with 0(1) cost. The first Z\ — \ elements of the HMM memory are 
kept in the level 1 cache, the first Z 2 — 1 elements in the level 2 cache, etc. One el- 
ement in each cache is used as a "dynamic entry" which allows access to elements 
on higher levels. Accessing a location at level i is then done by incorporating the 
memory item in the dynamic element of each of the caches closer to the proces- 
sor. This "cache hierarchy" obeys the inclusion principle, but it does not do any 
replacement. Memory elements are exchanged — as in HMM — by moving them to 
the processor and writing them back to their new location. 

If we charge /(Z,-_i — 1) — /(Z,_ 2 — 1) to a cache miss on cache i, an access to 
element at position x in cache at level k costs X;=i/(Z*'-i — 1) — /(Z;_ 2 — 1) = 
/(Zjt_i — 1) — f{0), which is at most f[x). Thus, the access cost for accessing el- 
ement x is the same in the HMM as in this "cached" HMM model. The cost T[n) 
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of an algorithm in the HMM model can be bounded by the cost of the algorithm in 
the multilevel model, which is at least 

r 

T{n)=d(w{n)+^(f{Z i -l)-f{Z i ^-l))Q{n;Z i/ l)) . 

Since W[n) > W*[n) and Q{n\Z if 1) > Q*{n;Z ir 1), the lemma follows. □ 

Theorem 32 A cache-oblivious algorithm that is optimal in the ideal-cache model and 
whose cache complexity is regular can be executed in optimal expected time in the HMM 
model, if the cost function is monotonically nondecreasing and satisfies f{2x) = @{f[x)). 

Proof Assume that the cache at level r is big enough to hold all elements used 
during the execution of the algorithm. We choose Z lr • • • ,Z r such that 2/(Z, i 
1) < Zj ■ — 1 = 0(Z,_! — 1) for all 1 < i < r. Such a sequence can be computed 
given that / is monotonically nondecreasing and satisfies f[2x) = 0(/(x) ). 

We execute the algorithm as described in Lemma 30 on the HMM model with 
2Zi, 2Z 2/ . . . , 2Z r . The cost of warming up the caches is Y-\<i<o(z r ) J : (*') = ®{Zrf[Z r )) 
which is asymptotically no greater than the cost of the algorithm even if it accesses 
each input item just once. The result follows from Lemmata 18, 30 and 31. □ 

Summary 

One strength of the ideal-cache model, compared to other models studied in the lit- 
erature, is that designing and analyzing algorithms is easier. But this section shows 
that the assumptions of the ideal-cache model are not stronger than the assump- 
tions of two hierarchical memory models in the literature. Specifically, we have 
shown that optimal cache-oblivious algorithms in the ideal-cache model are also 
optimal in the hierarchical memory model (HMM) [1] and in the serial uniform 
memory hierarchy (SUMH) model [5, 42]. 

Due to its simplifications, the ideal-cache model falls short of modeling some of 
the idiosyncrasies of a real-world memory hierarchy. It ignores issues such as con- 
flict misses, and has only one level of caching. In developing recursive algorithms, 
however, we have found that these additional complications are comparatively 
easy to deal with once an algorithm has been designed in the ideal model. 
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Section 9 



Related work 



In this section, we discuss the origin of the notion of cache obliviousness. We also 
give an overview of other hierarchical memory models. 

Our research group at MIT noticed as far back as 1994 that divide-and-conquer 
matrix multiplication was a cache-optimal algorithm that required no tuning, but 
we did not adopt the term "cache-oblivious" until 1997. This matrix-multiplication 
algorithm, as well as a cache-oblivious algorithm for LU-decomposition without 
pivoting, eventually appeared in [9]. Shortly after leaving our research group, 
Toledo [40] independently proposed a cache-oblivious algorithm for LU-decom- 
position, but with pivoting. For n x n matrices, Toledo's algorithm uses 8(n 3 ) 
work and incurs 6(1 + n 2 /L + n^/Ly/Z) cache misses. More recently, our group 
has produced an FFT library called FFTW [20], which in its most recent incar- 
nation [19], employs a register-allocation and scheduling algorithm inspired by 
our cache-oblivious FFT algorithm. The general idea that divide-and-conquer en- 
hances memory locality has been known for a long time [36]. 

Previous theoretical work on understanding hierarchical memories and the 
1/ O-complexity of algorithms has been studied in cache-aware models lacking an 
automatic replacement strategy. Hong and Kung [25] use the red-blue pebble game 
to prove lower bounds on the I /O-complexity of matrix multiplication, FFT, and 
other problems. The red-blue pebble game models temporal locality using two 
levels of memory. The model was extended by Savage [33] for deeper memory 
hierarchies. Aggarwal and Vitter [3] introduced spatial locality and investigated 
a two-level memory in which a block of P contiguous items can be transferred in 
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one step. They obtained tight bounds for matrix multiplication, FFT, sorting, and 
other problems. The hierarchical memory model (HMM) by Aggarwal et al. [1] 
treats memory as a linear array, where the cost of an access to element at location x 
is given by a cost function f{x). The BT model [2] extends HMM to support block 
transfers. The UMH model by Alpern et al. [5] is a multilevel model that allows 
I/O at different levels to proceed in parallel. Vitter and Shriver introduce paral- 
lelism, and they give algorithms for matrix multiplication, FFT, sorting, and other 
problems in both a two-level model [43] and several parallel hierarchical mem- 
ory models [44]. Vitter [41] provides a comprehensive survey of external-memory 
algorithms. 
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Section 10 



Wir stehen selbst enttauscht und sehn betroffen 
Den Vorhang zu und alle Fragen offen. [...] 
Verehrtes Publikum, los, such dir selbst den SchluB! 
Es muB ein guter da sein, muB, muB, muB! 

We feel deflated too. We too are nettled 
To see the curtain down and nothing settled. [...] 
You write the happy ending to the play! 
There must, there must, there's got to be a way! 

BERTOLD BRECHT, Der gute Mensch von Sezuan, 1940 



Conclusion 



This thesis has introduced the notion of cache obliviousness and has presented 
asymptotically optimal cache-oblivious algorithms for fundamental problems. Fig- 
ure 10 gives an overview of the known efficient cache-oblivious algorithms, most 
of which are described in this thesis. Two that we have not discussed are matrix 
addition and LUP-decomposition. For matrix addition, a simple iterative algo- 
rithm turns out to be cache-optimal if the matrix elements are read in the same 
order in which they are stored in memory. The algorithm for LUP-decomposition 
is due to Toledo [40], but it uses cache-aware algorithms as subprocedures. By ap- 
plying the cache-oblivious algorithms presented here, however, his algorithm can 
be converted into a cache-oblivious one. 

The remainder of this section outlines research questions related to cache oblivi- 
ousness. Section 10.1 discusses the engineering task of implementing cache-oblivi- 
ous algorithms. Section 10.2 discusses cache-oblivious data structures and briefly 
presents a cache-oblivious data structure for static binary search trees. Section 10.3 
raises two theoretical questions about the general power of cache-oblivious algo- 
rithms. Section 10.4 discusses divide-and-conquer as a programming strategy and 
the tools needed to help programmers to write recursive programs. In Section 10.5, 
I argue that because divide-and-conquer works well with cache hierarchies and 
also with parallel computers, the coming revolution of shared-memory multipro- 
cessors will make this design paradigm of paramount importance. 
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Figure 10-1: Overview of the known cache-oblivious algorithms. Except for matrix addi- 
tion (f ) and LUP-decomposition (J), all these algorithms are presented in this thesis. 

10.1 Engineering cache-oblivious algorithms 

The job is not done after an efficient algorithm has been designed in the ideal- 
cache model. The software-engineering task of programming the algorithm on a 
real machine remains to be done. This task often involves coping with the less- 
than-ideal behavior of real caches. Nevertheless, if the original algorithm in the 
ideal-cache model exploits locality effectively a program based on the algorithm 
can usually be made to run efficiently in practice. If the algorithm fails to exploit 
locality in the ideal-cache model, the algorithm will be slow no matter what the 
real-world computer environment looks like. 

The trend in architecture is towards bigger caches with steeper hierarchies and 
towards new cache organizations which employ more "intelligent" algorithms to 
use the cache memory more effectively. But even when caches get more intelli- 
gent, the algorithm designer retains responsibility to ensure that frequently ac- 
cessed data has the opportunity to reside in cache. 

Both cache-aware and cache-oblivious strategies can be used to achieve good 
caching behavior of an algorithm. This thesis has shown that optimal cache-ob- 
livious algorithms exist which have the same cache complexity as optimal cache- 
aware algorithms. But how do these two strategies compare in practice? How 
much faster is a cache-aware algorithm optimized for a given architecture than a 
cache-oblivious algorithm that solved the same problem? 
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Initial measurements I have taken indicate that cache-oblivious algorithms can 
rival the performance of hand-tuned cache-aware code, but in general cache-aware 
programs are faster. I hope to quantify this difference, as well as resolve other em- 
pirical questions. How do more levels of caching affect the difference between 
cache-aware and cache-oblivious algorithms? By how much do cache-aware algo- 
rithms slow down when executed on hardware they are not optimized for? An- 
swers to these questions will become increasingly important as cache hierarchies 
become more pronounced. 

10.2 Cache- oblivious data structures 

This section discusses cache-oblivious data structures and briefly presents a cache- 
oblivious data structure for static binary search trees. 

As there are cache-oblivious algorithms, there are cache-oblivious data struc- 
tures. The blocked layout (Figure 2-1 (c)), for example, is cache aware. To optimize 
it for a certain cache, the line length must be known. The bit-interleaved layout 
(Figure 2-1 (d)), however, is cache oblivious and has the same asymptotic behavior 
as the blocked layout for matrix multiplication. Other cache oblivious layouts for 
matrices exist like the Morton or Hilbert layouts discussed in [12, 13, 18]. 

Different data layouts can greatly affect the asymptotic behavior of an algo- 
rithm. For cache-optimal matrix multiplication, as discussed in Section 2, the tall 
cache requirement can be relaxed if matrices are stored in blocked (Figure 2-1 (c)) 
or bit-interleaved order (Figure 2-l(d)). In Section 7.1 we have shown that the 
number of cache misses for the ordinary matrix multiplication algorithm can be 
reduced by a factor of O(L) by choosing a different data layout. 

Can the idea of cache obliviousness be extended to data structures? Do efficient 
cache-oblivious data structures exist for dynamic data structures, such as linked 
lists, heaps, or trees? Although I do not yet know the answer to this question, I 
have been able to devise a cache-oblivious layout for static binary search trees that 
is 0(l)-competitive with the performance of B-Trees [28], which are used in file 
systems and other out-of-core applications because of their low cache complexity. 

Figure 10-2 shows the cache-oblivious layout for a complete binary search tree 
of height 4. Let T be a complete binary tree of height h = 0(lg n), where n is the 
number of elements in the tree. To find the layout, divide T at level [h /2J , which 
separates T into subtree T 0 (top [h /2J levels) and k < 2 ^/ 2 J subtrees having height 
at most \h/2\ . The cache-oblivious data layout C{T) of T is defined recursively as 
follows. 

C{T)=C{To) || £(Ti) || •••II C{T k ), 
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memory 

Figure 10-2: Cache-oblivious layout of a binary tree of height 4 with 15 elements 1, 
2, . . . , 15. The values stored in the nodes of the tree are shown in the order in which they 
are stored. Pointers to left children are shown in grey and pointers to right children in 
black. 

where || is the concatenation operator. The base case, when the tree has only one 
node, is trivial. The cache complexity of finding an element in this data structure 
on a (Z, L) ideal-cache is 0(lg L «), which is asymptotically equivalent to the per- 
formance of a B-Tree. Making this layout strategy work for dynamic search trees 
is a high research priority. 

10.3 Complexity of cache obliviousness 

In this section, we discuss whether a separation theorem can be proven, showing 
that certain problems can only be solved cache-optimally by a cache-aware algo- 
rithm. We also discuss whether a simulation result can be proven that bounds the 
advantage of cache-aware algorithms over cache-oblivious algorithms. 

We know now that many optimal cache-oblivious algorithms exist. But, how 
powerful are cache-oblivious algorithms compared to cache-aware algorithms in 
general? 

Separation: Is there a separation in asymptotic complexity between cache-aware 
and cache-oblivious algorithms? 

It appears that cache-aware algorithms should be able to use caches better than 
cache-oblivious algorithms since they have more knowledge about the system they 
are running on. But so far, I have not found a cache-aware algorithm that has better 
asymptotic behavior than a well-designed cache-oblivious algorithm. Neverthe- 
less, I do believe a seperating problem exists. I conjecture that for such a seperat- 
ing problem, the best cache-oblivious algorithm has a factor of Q(lg Z) more cache 
misses than the best cache-aware algorithm. 

Simulation: Given a class of optimal cache-aware algorithms to solve a single 
problem, can we construct a good cache-oblivious algorithm that solves the 
same problem? 
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I believe that the gap between cache-aware and cache-oblivious algorithms (if it 
exists) is not bigger than a factor of O(lgZ) difference. Perhaps this result can 
be proven by using simulation techniques to convert a class of cache-aware algo- 
rithms into a cache-oblivious algorithm. I have not yet had much success in this 
line of research, however. 

10.4 Compiler support for divide- and- conquer 

This section discusses how new compiler techniques can help to ease the program- 
ming of divide-and-conquer algorithms. Most algorithms given in this thesis are 
divide-and-conquer algorithms. Conventional wisdom says that recursive proce- 
dures should be converted into iterative loops in order to improve performance [8]. 
While this strategy was effective ten years ago, many recursive programs now ac- 
tually run faster than their iterative counterparts. So far most of the work by archi- 
tects and compiler writers is concentrated on loop-based iterative programs. Their 
tools are often not appropriate for recursion and divide-and-conquer programs. 

For a divide-and-conquer algorithm to be efficient, the base case must be ef- 
ficiently coded. Coding recursion with a simple "unit" base case is usually easy 
for a programmer, but then the overhead of the recursive implementation can be 
substantial. To get full performance out of a recursive algorithm, it is necessary 
to coarsen the base case of recursion (a transformation called "unfolding" in [18]), 
which is analogous to loop unrolling. Coarsening of base cases is motivated by 
the observation that for many recursive algorithms, the overhead of recursion is 
often in the lowest few levels, near the leaves. With a branching factor of 2, for 
example, 97% of the recursive function calls are in the bottom 5 levels of recursion. 
The proportion is even higher for branching factors greater than 2. 

Typically, a variety of coarsened base cases must be written, making it hard 
to code by hand. Can a compiler effectively generate coarsened base cases? This 
problem is much like loop-unrolling, which is already done by compilers. 

Matteo Frigo, a member of our research group, and Steve Johnson, also at MIT, 
have implemented a discrete Fourier transform library FFTW [20] that incorpo- 
rates a cache-oblivious algorithm with a specialized compiler to generate coars- 
ened base cases. I believe that parts of their technique can be extended to general 
divide-and-conquer algorithms. 

FFTW also employs an adaptive runtime execution, which chooses base cases 
during an initialization phase of the program. This strategy is effective when the 
question of which of several coarsened base cases yields the fastest results on a 
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given architecture cannot be determined at compile time. An adaptive execution 
strategy allows the compiler to produce several distinct base-case implementa- 
tions at different granularities and with different strategies, and then at runtime 
initialization, choose the fastest for the particular machine by timing the various 
alternatives. Benchmarks performed on a variety of platforms show that FFTW's 
performance is typically superior to that of other publicly available FFT software, 
and it rivals or is better than many hand-coded vendor libraries. 

10.5 The future of divide- and- conquer 

Shared-memory multiprocessors are now available as deskside workstations and 
will appear in desktop PC's in the next two years and in mobile laptops within five 
years. Divide-and-conquer algorithms seem to be a perfect match for these parallel 
machines in which the technologies of parallelism and caching are converging. 

In a shared-memory multiprocessor machine, multiple processors, each having 
its own cache, work together to solve problems faster, communicating through a 
single shared memory. 

Our research group discovered, while working on the parallel programming 
language Cilk [39, 9], that divide-and-conquer programs work well with shared- 
memory multiprocessors. In Cilk a function can be "spawned", making it logically 
parallel to the spawning procedure. Since the Cilk scheduler decides at runtime 
whether two logically parallel functions are actually executed in parallel, a Cilk 
program is processor oblivious. It can be effectively executed on many processors, 
as long as the problem has enough inherent parallelism. Rugina and Rinard [32] 
have experimented with automatic parallelization from C to Cilk and achieved 
good speedups on divide-and-conquer programs. 

Recursive calls can often be replaced by recursive spawns, which allow the chil- 
dren to work in parallel. Once the division phase is complete, the subproblems are 
usually independent and can therefore be solved in parallel. Our experiments with 
Cilk show that divide-and-conquer algorithms scale well and have good memory 
behavior on a parallel machine. The number of cache misses of a Cilk program can 
be upper-bounded using the cache complexity of its C elision (the Cilk program 
without the parallel keywords) as shown in [9, 10]. 

Can we design algorithms which are optimal with respect to work, parallelism, 
and cache complexity but which are also cache oblivious and processor oblivious? 
I believe that resource-oblivious versions of the algorithms given in this thesis can 
be proven to satisfy all three optimality requirements. 

Small shared-memory multiprocessors are readily avaiable: A 4-processor ma- 
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chine costs less than $20,000 [26]. Most of these machines are designed to be 
servers, but workstations intended to be used by a single user are starting to ap- 
pear [35]. These machines will become more common over the next few years, and 
it is expected that we will see a shared-memory multiprocessor-on-a-chip within a 
few years [23, 27]. Writing efficient parallel programs is considered hard. Caching 
problems are more pronounced in these machines than they are in single-processor 
machines. Memory hierarchies will be bigger and steeper in the future, and cache 
misses will be more expensive. The new Alpha 21264 chip [14], for example, 
can deliver 2 words from Ll-cache in one cycle, but it takes around 100 cycles 
to fetch from main memory. Divide-and-conquer seems to provide a way to write 
processor- and cache-oblivious algorithms, which will help to ease programming 
on the future machines. 
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